Genetic association studies for gene expressions: permutation-based mutual information in a comparison with standard ANOVA and as a novel approach for feature selection

نویسندگان

  • Silke Szymczak
  • Angelo Nuzzo
  • Christian Fuchsberger
  • Daniel F Schwarz
  • Andreas Ziegler
  • Riccardo Bellazzi
  • Bernd-Wolfgang Igl
چکیده

Mutual information (MI) is a robust nonparametric statistical approach for identifying associations between genotypes and gene expression levels. Using the data of Problem 1 provided for the Genetic Analysis Workshop 15, we first compared a quantitative MI (Tsalenko et al. 2006 J Bioinform Comput Biol 4:259-4) with the standard analysis of variance (ANOVA) and the nonparametric Kruskal-Wallis (KW) test. We then proposed a novel feature selection approach using MI in a classification scenario to address the small n - large p problem and compared it with a feature selection that relies on an asymptotic chi2 distribution. In both applications, we used a permutation-based approach for evaluating the significance of MI. Substantial discrepancies in significance were observed between MI, ANOVA, and KW that can be explained by different empirical distributions of the data. In contrast to ANOVA and KW, MI detects shifts in location when the data are non-normally distributed, skewed, or contaminated with outliers. ANOVA but not MI is often significant if one genotype with a small frequency had a remarkable difference in the average gene expression level relative to the other two genotypes. MI depends on genotype frequencies and cannot detect these differences. In the classification scenario, we show that our novel approach for feature selection identifies a smaller list of markers with higher accuracy compared to the standard method. In conclusion, permutation-based MI approaches provide reliable and flexible statistical frameworks which seem to be well suited for data that are non-normal, skewed, or have an otherwise peculiar distribution. They merit further methodological investigation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integration and Reduction of Microarray Gene Expressions Using an Information Theory Approach

The DNA microarray is an important technique that allows researchers to analyze many gene expression data in parallel. Although the data can be more significant if they come out of separate experiments, one of the most challenging phases in the microarray context is the integration of separate expression level datasets that have gathered through different techniques. In this paper, we prese...

متن کامل

A Novel Intrusion Detection Systems based on Genetic Algorithms-suggested Features by the Means of Different Permutations of Labels’ Orders

Intrusion detection systems (IDS) by exploiting Machine learning techniques are able to diagnose attack traffics behaviors. Because of relatively large numbers of features in IDS standard benchmark dataset, like KDD CUP 99 and NSL_KDD, features selection methods play an important role. Optimization algorithms like Genetic algorithms (GA) are capable of finding near-optimum combination of the fe...

متن کامل

A Novel Subsampling Method for 3D Multimodality Medical Image Registration Based on Mutual Information

Mutual information (MI) is a widely used similarity metric for multimodality image registration. However, it involves an extremely high computational time especially when it is applied to volume images. Moreover, its robustness is affected by existence of local maxima. The multi-resolution pyramid approaches have been proposed to speed up the registration process and increase the accuracy of th...

متن کامل

A Novel Hybrid Approach for Email Spam Detection based on Scatter Search Algorithm and K-Nearest Neighbors

Because cyberspace and Internet predominate in the life of users, in addition to business opportunities and time reductions, threats like information theft, penetration into systems, etc. are included in the field of hardware and software. Security is the top priority to prevent a cyber-attack that users should initially be detecting the type of attacks because virtual environments are not moni...

متن کامل

Feature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine

Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods.  In filter methods, features subsets are selected due to some measu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • BMC Proceedings

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2007